Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language

نویسندگان

Marcin Kuta

Pawel Chrzaszcz

Jacek Kitowski

چکیده

The paper is devoted to the issue of correction of the erroneous and ambiguous corpus of Frequency Dictionary of Contemporary Polish (FDCP) and its application to morphosyntactic tagging of the Polish language. Several stages of corpus transformation are presented and baseline part-of-speech tagging algorithms are evaluated, too.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language

We describe a domain-specific method of adapting conditional random fields (CRFs) to morphosyntactic tagging of highly-inflectional languages. The solution involves extending CRFs with additional, position-wise restrictions on the output domain, which are used to impose consistency between the modeled label sequences and morphosyntactic analysis results both at the level of decoding and, more i...

متن کامل

Automatic Detection of Annotation Errors in Polish-Language Corpora

In this article we propose an extension to the variation ngram based method of detecting annotation errors. We also show an approach to nding anomalies in the morphosyntactic annotation layer by using association rule discovery. As no research has previously been done in the eld of morphosyntactic annotation error correction for Polish, we provide novel results based on experiments on the large...

متن کامل

Multi-source morphosyntactic tagging for spoken Rusyn

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolki...

متن کامل

Towards an LFG parser for Polish: An exercise in parasitic grammar development

While it is possible to build a formal grammar manually from scratch or, going to another extreme, to derive it automatically from a treebank, the development of the LFG grammar of Polish presented in this paper is different from both of these methods as it relies on extensive reuse of existing language resources for Polish. LFG grammars minimally provide two levels of representation: constitue...

متن کامل

PoliTa: A multitagger for Polish

Part-of-Speech (POS) tagging is a crucial task in Natural Language Processing (NLP). POS tags may be assigned to tokens in text manually, by trained linguists, or using algorithmic approaches. Particularly, in the case of annotated text corpora, the quantity of textual data makes it unfeasible to rely on manual tagging and automated methods are used extensively. The quality of such methods is o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computing and Informatics

دوره 28 شماره

صفحات -

تاریخ انتشار 2009

Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language

نویسندگان

چکیده

منابع مشابه

Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language

Automatic Detection of Annotation Errors in Polish-Language Corpora

Multi-source morphosyntactic tagging for spoken Rusyn

Towards an LFG parser for Polish: An exercise in parasitic grammar development

PoliTa: A multitagger for Polish

عنوان ژورنال:

اشتراک گذاری